NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Identification and applications of disease-associated differential human and bacterial proteins with metaproteomic evidence

https://doi.org/10.1007/s13755-025-00369-z

Canderan, Jamie; Stamboulian, Moses; Ye, Yuzhen (August 2025, Health Information Science and Systems)

Abstract The gut microbiome plays a fundamental role in human health and disease. Individual variations in the microbiome and the corresponding functional implications are key considerations to enhance precision health and medicine. Metaproteomics has recently revealed protein expression that might be associated with human health and disease. Existing studies focused on either human proteins or bacterial proteins that can be identified from (meta)proteomics data sets, but not both. In this study, we examined the feasibility of identifying both human and bacterial proteins that are differentially expressed between healthy and diseased individuals from metaproteomics data sets. We further evaluated different strategies of using identified peptides and proteins for building predictive models. By leveraging existing metaproteomics data sets and a tool that we have developed for metaproteomics data analysis (MetaProD), we were able to derive both human and bacterial differentially expressed proteins that could serve as potential biomarkers for all diseases we studied. We also built predictive models using identified peptides and proteins as features for prediction of human diseases. Our results showed peptide-based identifications over protein-based ones often produce the most accurate models and that feature selection can offer improvements. Prediction accuracy could be further improved, in some cases, by including bacterial identifications, but missing data in bacterial identifications remains problematic.
more » « less
Machine Learning in Small-Molecule Mass Spectrometry

https://doi.org/10.1146/annurev-anchem-071224-082157

Hong, Yuhui; Ye, Yuzhen; Tang, Haixu (May 2025, Annual Review of Analytical Chemistry)

Tandem mass spectrometry (MS/MS) is crucial for small-molecule analysis; however, traditional computational methods are limited by incomplete reference libraries and complex data processing. Machine learning (ML) is transforming small-molecule mass spectrometry in three key directions: (a) predicting MS/MS spectra and related physicochemical properties to expand reference libraries, (b) improving spectral matching through automated pattern extraction, and (c) predicting molecular structures of compounds directly from their MS/MS spectra. We review ML approaches for molecular representations [descriptors, simplified molecular-input line-entry (SMILE) strings, and graphs] and MS/MS spectra representations (using binned vectors and peak lists) along with recent advances in spectra prediction, retention time, collision cross sections, and spectral matching. Finally, we discuss ML-integrated workflows for chemical formula identification. By addressing the limitations of current methods for compound identification, these ML approaches can greatly enhance the understanding of biological processes and the development of diagnostic and therapeutic tools.
more » « less
Free, publicly-accessible full text available May 15, 2026
Identification of microbial species and proteins associated with colorectal cancer by reanalyzing CPTAC proteomic datasets

https://doi.org/10.1038/s41598-025-97984-3

Canderan, Jamie; Ye, Yuzhen (April 2025, Scientific Reports)
Incorporating metabolic activity, taxonomy and community structure to improve microbiome-based predictive models for host phenotype prediction

https://doi.org/10.1080/19490976.2024.2302076

Monshizadeh, Mahsa; Ye, Yuzhen (December 2024, Gut Microbes)

Full Text Available
Multitask knowledge-primed neural network for predicting missing metadata and host phenotype based on human microbiome

https://doi.org/10.1093/bioadv/vbae203

Monshizadeh, Mahsa; Hong, Yuhui; Ye, Yuzhen (December 2024, Bioinformatics Advances)
Lengauer, Thomas (Ed.)
Abstract MotivationMicrobial signatures in the human microbiome are closely associated with various human diseases, driving the development of machine learning models for microbiome-based disease prediction. Despite progress, challenges remain in enhancing prediction accuracy, generalizability, and interpretability. Confounding factors, such as host’s gender, age, and body mass index, significantly influence the human microbiome, complicating microbiome-based predictions. ResultsTo address these challenges, we developed MicroKPNN-MT, a unified model for predicting human phenotype based on microbiome data, as well as additional metadata like age and gender. This model builds on our earlier MicroKPNN framework, which incorporates prior knowledge of microbial species into neural networks to enhance prediction accuracy and interpretability. In MicroKPNN-MT, metadata, when available, serves as additional input features for prediction. Otherwise, the model predicts metadata from microbiome data using additional decoders. We applied MicroKPNN-MT to microbiome data collected in mBodyMap, covering healthy individuals and 25 different diseases, and demonstrated its potential as a predictive tool for multiple diseases, which at the same time provided predictions for the missing metadata. Our results showed that incorporating real or predicted metadata helped improve the accuracy of disease predictions, and more importantly, helped improve the generalizability of the predictive models. Availability and implementationhttps://github.com/mgtools/MicroKPNN-MT.
more » « less
Full Text Available
Protein domain embeddings for fast and accurate similarity search

https://doi.org/10.1101/gr.279127.124

Iovino, Benjamin Giovanni; Tang, Haixu; Ye, Yuzhen (September 2024, Genome Research)

Recently developed protein language models have enabled a variety of applications with the protein contextual embeddings they produce. Per-protein representations (each protein is represented as a vector of fixed dimension) can be derived via averaging the embeddings of individual residues, or applying matrix transformation techniques such as the discrete cosine transformation (DCT) to matrices of residue embeddings. Such protein-level embeddings have been applied to enable fast searches of similar proteins; however, limitations have been found; for example, PROST is good at detecting global homologs but not local homologs, and knnProtT5 excels for proteins with single domains but not multidomain proteins. Here, we propose a novel approach that first segments proteins into domains (or subdomains) and then applies the DCT to the vectorized embeddings of residues in each domain to infer domain-level contextual vectors. Our approach, called DCTdomain, uses predicted contact maps from ESM-2 for domain segmentation, which is formulated as adomain segmentationproblem and can be solved using arecursive cutalgorithm (RecCut in short) in quadratic time to the protein length; for comparison, an existing approach for domain segmentation uses a cubic-time algorithm. We show such domain-level contextual vectors (termed asDCT fingerprints) enable fast and accurate detection of similarity between proteins that share global similarities but with undefined extended regions between shared domains, and those that only share local similarities. In addition, tests on a database search benchmark show that the DCTdomain is able to detect distant homologs by leveraging the structural information in the contextual embeddings.
more » « less
Full Text Available
SpecEncoder: deep metric learning for accurate peptide identification in proteomics

https://doi.org/10.1093/bioinformatics/btae220

Liu, Kaiyuan; Tao, Chenghua; Ye, Yuzhen; Tang, Haixu (June 2024, Bioinformatics)

Abstract MotivationTandem mass spectrometry (MS/MS) is a crucial technology for large-scale proteomic analysis. The protein database search or the spectral library search are commonly used for peptide identification from MS/MS spectra, which, however, may face challenges due to experimental variations between replicated spectra and similar fragmentation patterns among distinct peptides. To address this challenge, we present SpecEncoder, a deep metric learning approach to address these challenges by transforming MS/MS spectra into robust and sensitive embedding vectors in a latent space. The SpecEncoder model can also embed predicted MS/MS spectra of peptides, enabling a hybrid search approach that combines spectral library and protein database searches for peptide identification. ResultsWe evaluated SpecEncoder on three large human proteomics datasets, and the results showed a consistent improvement in peptide identification. For spectral library search, SpecEncoder identifies 1%–2% more unique peptides (and PSMs) than SpectraST. For protein database search, it identifies 6%–15% more unique peptides than MSGF+ enhanced by Percolator, Furthermore, SpecEncoder identified 6%–12% additional unique peptides when utilizing a combined library of experimental and predicted spectra. SpecEncoder can also identify more peptides when compared to deep-learning enhanced methods (MSFragger boosted by MSBooster). These results demonstrate SpecEncoder’s potential to enhance peptide identification for proteomic data analyses. Availability and ImplementationThe source code and scripts for SpecEncoder and peptide identification are available on GitHub at https://github.com/lkytal/SpecEncoder. Contact: hatang@iu.edu.
more » « less
Accurate de novo peptide sequencing using fully convolutional neural networks

https://doi.org/10.1038/s41467-023-43010-x

Liu, Kaiyuan; Ye, Yuzhen; Li, Sujun; Tang, Haixu (December 2023, Nature Communications)

Abstract De novo peptide sequencing, which does not rely on a comprehensive target sequence database, provides us with a way to identify novel peptides from tandem mass spectra. However, current de novo sequencing algorithms suffer from low accuracy and coverage, which hinders their application in proteomics. In this paper, we presentPepNet, a fully convolutional neural network for high accuracy de novo peptide sequencing. PepNet takes an MS/MS spectrum (represented as a high-dimensional vector) as input, and outputs the optimal peptide sequence along with its confidence score. The PepNet model is trained using a total of 3 million high-energy collisional dissociation MS/MS spectra from multiple human peptide spectral libraries. Evaluation results show that PepNet significantly outperforms current best-performing de novo sequencing algorithms (e.g. PointNovo and DeepNovo) in both peptide-level accuracy and positional-level accuracy. PepNet can sequence a large fraction of spectra that were not identified by database search engines, and thus could be used as a complementary tool to database search engines for peptide identification in proteomics. In addition, PepNet runs around 3x and 7x faster than PointNovo and DeepNovo on GPUs, respectively, thus being more suitable for the analysis of large-scale proteomics data.
more » « less
3DMolMS: prediction of tandem mass spectra from 3D molecular conformations

https://doi.org/10.1093/bioinformatics/btad354

Hong, Yuhui; Li, Sujun; Welch, Christopher J.; Tichy, Shane; Ye, Yuzhen; Tang, Haixu; Elofsson, ed., Arne (May 2023, Bioinformatics)

Abstract MotivationTandem mass spectrometry is an essential technology for characterizing chemical compounds at high sensitivity and throughput, and is commonly adopted in many fields. However, computational methods for automated compound identification from their MS/MS spectra are still limited, especially for novel compounds that have not been previously characterized. In recent years, in silico methods were proposed to predict the MS/MS spectra of compounds, which can then be used to expand the reference spectral libraries for compound identification. However, these methods did not consider the compounds’ 3D conformations, and thus neglected critical structural information. ResultsWe present the 3D Molecular Network for Mass Spectra Prediction (3DMolMS), a deep neural network model to predict the MS/MS spectra of compounds from their 3D conformations. We evaluated the model on the experimental spectra collected in several spectral libraries. The results showed that 3DMolMS predicted the spectra with the average cosine similarity of 0.691 and 0.478 with the experimental MS/MS spectra acquired in positive and negative ion modes, respectively. Furthermore, 3DMolMS model can be generalized to the prediction of MS/MS spectra acquired by different labs on different instruments through minor fine-tuning on a small set of spectra. Finally, we demonstrate that the molecular representation learned by 3DMolMS from MS/MS spectra prediction can be adapted to enhance the prediction of chemical properties such as the elution time in the liquid chromatography and the collisional cross section measured by ion mobility spectrometry, both of which are often used to improve compound identification. Availability and implementationThe codes of 3DMolMS are available at https://github.com/JosieHong/3DMolMS and the web service is at https://spectrumprediction.gnps2.org.
more » « less
Locality-Sensitive Hashing-Based k-Mer Clustering for Identification of Differential Microbial Markers Related to Host Phenotype

https://doi.org/10.1089/cmb.2021.0640

Han, Wontack; Tang, Haixu; Ye, Yuzhen (July 2022, Journal of Computational Biology)

Full Text Available

« Prev Next »

Search for: All records